Intro
Our interest in epidemiology led us to explore the factors associated with dementia. Given the increasing prevalence of Alzheimer’s in the U.S., its severity, and the lack of a cure, it is imperative to understand the causes of this disease, as a means of addressing the risk from an early age. While this topic has already been explored, we nevertheless sought to use the OASIS (Open Access Series of Imaging Studies) dataset to create a model that would examine the probability of developing dementia based on various variables.
dement <- data.frame(read_csv("~/Desktop/oasis_longitudinal.csv"))
## Parsed with column specification:
## cols(
## `Subject ID` = col_character(),
## `MRI ID` = col_character(),
## Group = col_character(),
## Visit = col_integer(),
## `MR Delay` = col_integer(),
## `M/F` = col_character(),
## Hand = col_character(),
## Age = col_integer(),
## EDUC = col_integer(),
## SES = col_integer(),
## MMSE = col_integer(),
## CDR = col_double(),
## eTIV = col_integer(),
## nWBV = col_double(),
## ASF = col_double()
## )
head(dement)
## Subject.ID MRI.ID Group Visit MR.Delay M.F Hand Age EDUC
## 1 OAS2_0001 OAS2_0001_MR1 Nondemented 1 0 M R 87 14
## 2 OAS2_0001 OAS2_0001_MR2 Nondemented 2 457 M R 88 14
## 3 OAS2_0002 OAS2_0002_MR1 Demented 1 0 M R 75 12
## 4 OAS2_0002 OAS2_0002_MR2 Demented 2 560 M R 76 12
## 5 OAS2_0002 OAS2_0002_MR3 Demented 3 1895 M R 80 12
## 6 OAS2_0004 OAS2_0004_MR1 Nondemented 1 0 F R 88 18
## SES MMSE CDR eTIV nWBV ASF
## 1 2 27 0.0 1987 0.696 0.883
## 2 2 30 0.0 2004 0.681 0.876
## 3 NA 23 0.5 1678 0.736 1.046
## 4 NA 28 0.5 1738 0.713 1.010
## 5 NA 22 0.5 1698 0.701 1.034
## 6 3 28 0.0 1215 0.710 1.444
str(dement)
## 'data.frame': 373 obs. of 15 variables:
## $ Subject.ID: chr "OAS2_0001" "OAS2_0001" "OAS2_0002" "OAS2_0002" ...
## $ MRI.ID : chr "OAS2_0001_MR1" "OAS2_0001_MR2" "OAS2_0002_MR1" "OAS2_0002_MR2" ...
## $ Group : chr "Nondemented" "Nondemented" "Demented" "Demented" ...
## $ Visit : int 1 2 1 2 3 1 2 1 2 3 ...
## $ MR.Delay : int 0 457 0 560 1895 0 538 0 1010 1603 ...
## $ M.F : chr "M" "M" "M" "M" ...
## $ Hand : chr "R" "R" "R" "R" ...
## $ Age : int 87 88 75 76 80 88 90 80 83 85 ...
## $ EDUC : int 14 14 12 12 12 18 18 12 12 12 ...
## $ SES : int 2 2 NA NA NA 3 3 4 4 4 ...
## $ MMSE : int 27 30 23 28 22 28 27 28 29 30 ...
## $ CDR : num 0 0 0.5 0.5 0.5 0 0 0 0.5 0 ...
## $ eTIV : int 1987 2004 1678 1738 1698 1215 1200 1689 1701 1699 ...
## $ nWBV : num 0.696 0.681 0.736 0.713 0.701 0.71 0.718 0.712 0.711 0.705 ...
## $ ASF : num 0.883 0.876 1.046 1.01 1.034 ...
The dataset has 15 different variables with 373 observations. The response we are looking to predict is the “Group” variable with values “nondemented” and “demented”. This variable will need to be changed to be numeric, coding “1” for “demented” and “0” for “nondemented”. The Subject.ID and MRI.ID columns are of no use, as they are used to identify the subjects and scans. The Hand variable is also not helpful since all of the subjects in the dataset are right-handed. It also doesn’t make sense to include the visit column, since this is a time-related variable that could confound the results. Finally, the CDR variable (Clinical Dementia Rating) is a scaled measure of the degree of dementia. We will drop this variable since we are only concerned with finding the presence or dementia and not the degree it manifests itself. The remaining variables are all attributes of the subject, providing a good basis as factors used to predict Alzheimer’s. The Male/Female variable will have to be coded to 1/0 so it can be used in the regression. Additionally, all observations containing a NA will need to be removed so these observations can be included to test the regression as well.
#drop unnecessary columns
drops <- c("Subject.ID","MRI.ID","Hand","Visit","CDR")
alzheimers <- dement[,!(names(dement) %in% drops)]
#code nondemented -> 0 and demented -> 1
alzheimers$Dement <- as.numeric(factor(dement$Group,levels=c("Nondemented","Demented"))) - 1
#code female -> 0 and male -> 1
alzheimers$Sex <- as.numeric(factor(dement$M.F,levels=c("F","M"))) - 1
#drop old categorical demented/sex columns
drops2 <- c("Group","M.F")
alzheimers <- alzheimers[,!(names(alzheimers) %in% drops2)]
#drop rows with NA values
alzheimers <- alzheimers[complete.cases(alzheimers),]
head(alzheimers)
## MR.Delay Age EDUC SES MMSE eTIV nWBV ASF Dement Sex
## 1 0 87 14 2 27 1987 0.696 0.883 0 1
## 2 457 88 14 2 30 2004 0.681 0.876 0 1
## 6 0 88 18 3 28 1215 0.710 1.444 0 0
## 7 538 90 18 3 27 1200 0.718 1.462 0 0
## 8 0 80 12 4 28 1689 0.712 1.039 0 1
## 9 1010 83 12 4 29 1701 0.711 1.032 0 1
nrow(alzheimers)
## [1] 317
The dataset now consists of 317 observations of 10 variables after removing unnecessary variables and NA values and converting factor variables to numeric types. Now that the data is in a usable format, we can do an exploratory analysis on the variables available. Let’s see a summary of variables as well as visual represenations of each one.
## MR.Delay Age EDUC SES
## Min. : 0.0 Min. :60.00 Min. : 6.00 Min. :1.000
## 1st Qu.: 0.0 1st Qu.:71.00 1st Qu.:12.00 1st Qu.:2.000
## Median : 539.0 Median :76.00 Median :15.00 Median :2.000
## Mean : 581.5 Mean :76.72 Mean :14.62 Mean :2.546
## 3rd Qu.: 854.0 3rd Qu.:82.00 3rd Qu.:16.00 3rd Qu.:3.000
## Max. :2517.0 Max. :98.00 Max. :23.00 Max. :5.000
## MMSE eTIV nWBV ASF
## Min. : 4.00 Min. :1106 Min. :0.6440 Min. :0.876
## 1st Qu.:27.00 1st Qu.:1358 1st Qu.:0.7000 1st Qu.:1.098
## Median :29.00 Median :1476 Median :0.7320 Median :1.189
## Mean :27.26 Mean :1494 Mean :0.7306 Mean :1.192
## 3rd Qu.:30.00 3rd Qu.:1599 3rd Qu.:0.7570 3rd Qu.:1.293
## Max. :30.00 Max. :2004 Max. :0.8370 Max. :1.587
## Dement Sex
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.4006 Mean :0.4322
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
The data is randomized before dividing the training and test sets. The correlation of the randomized alzheimer’s dataset is explored.
alzheimers_rand <- alzheimers[sample(nrow(alzheimers)),]
cor(alzheimers_rand)
## MR.Delay Age EDUC SES MMSE
## MR.Delay 1.000000000 0.187725906 0.01625784 0.002416405 0.07969187
## Age 0.187725906 1.000000000 -0.04538631 -0.005011991 0.04605156
## EDUC 0.016257842 -0.045386312 1.00000000 -0.733018061 0.18507481
## SES 0.002416405 -0.005011991 -0.73301806 1.000000000 -0.13521876
## MMSE 0.079691871 0.046051556 0.18507481 -0.135218759 1.00000000
## eTIV 0.119530283 0.037093288 0.26858492 -0.289781242 -0.02063041
## nWBV -0.066573781 -0.497126215 0.01591187 0.048501515 0.37071410
## ASF -0.123866458 -0.024314516 -0.25229797 0.282656733 0.03169271
## Dement -0.174719076 -0.053649336 -0.22056522 0.164714936 -0.62328206
## Sex 0.035934196 -0.054018321 0.04080560 -0.027069290 -0.17487987
## eTIV nWBV ASF Dement Sex
## MR.Delay 0.11953028 -0.06657378 -0.123866458 -0.174719076 0.03593420
## Age 0.03709329 -0.49712621 -0.024314516 -0.053649336 -0.05401832
## EDUC 0.26858492 0.01591187 -0.252297965 -0.220565216 0.04080560
## SES -0.28978124 0.04850151 0.282656733 0.164714936 -0.02706929
## MMSE -0.02063041 0.37071410 0.031692707 -0.623282062 -0.17487987
## eTIV 1.00000000 -0.19507524 -0.988639123 -0.013106336 0.55770856
## nWBV -0.19507524 1.00000000 0.197789790 -0.331291545 -0.21454685
## ASF -0.98863912 0.19778979 1.000000000 0.004758789 -0.54714350
## Dement -0.01310634 -0.33129154 0.004758789 1.000000000 0.27437569
## Sex 0.55770856 -0.21454685 -0.547143500 0.274375689 1.00000000
pairs(alzheimers_rand)
Now we can begin to search for predictors of Alzheimers (observations in the “demented” group in the original dataset) using regression models. For this particular question, we choose to use a logistic regression. The outcome we are trying to predict is binary in nature, simply whether the patient is demented or not. This type of question lends itself well to logistic regression, allowing us to create a model which provides a probability of dementia based on a number of input variables. In addition, our dataset is sufficiently large so we are able to divide it into training and test sets. With 317 observations, we will put 253 in the training set and 64 in the test set. We will use the training set to fit the model and the test set to assess its accuracy.
set.seed(2000)
train <- alzheimers_rand[1:253,]
test <- alzheimers_rand[254:317,]
nrow(train)
## [1] 253
nrow(test)
## [1] 64
We first run a linear regression to see which variables are significant.
loadPkg("leaps")
loadPkg("ISLR")
reg.forward <- regsubsets(Dement~., data = alzheimers , method = "forward", nvmax = 11, nbest= 1)
plot(reg.forward, scale = "adjr2", main = "Adjusted R^2")
plot(reg.forward, scale = "bic", main = "BIC")
plot(reg.forward, scale = "Cp", main = "Cp")
summary(reg.forward)
## Subset selection object
## Call: regsubsets.formula(Dement ~ ., data = alzheimers, method = "forward",
## nvmax = 11, nbest = 1)
## 9 Variables (and intercept)
## Forced in Forced out
## MR.Delay FALSE FALSE
## Age FALSE FALSE
## EDUC FALSE FALSE
## SES FALSE FALSE
## MMSE FALSE FALSE
## eTIV FALSE FALSE
## nWBV FALSE FALSE
## ASF FALSE FALSE
## Sex FALSE FALSE
## 1 subsets of each size up to 9
## Selection Algorithm: forward
## MR.Delay Age EDUC SES MMSE eTIV nWBV ASF Sex
## 1 ( 1 ) " " " " " " " " "*" " " " " " " " "
## 2 ( 1 ) " " " " " " " " "*" " " " " " " "*"
## 3 ( 1 ) " " " " " " " " "*" "*" " " " " "*"
## 4 ( 1 ) "*" " " " " " " "*" "*" " " " " "*"
## 5 ( 1 ) "*" " " " " " " "*" "*" "*" " " "*"
## 6 ( 1 ) "*" " " "*" " " "*" "*" "*" " " "*"
## 7 ( 1 ) "*" "*" "*" " " "*" "*" "*" " " "*"
## 8 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*"
## Loaded glmnet 2.0-13
Let us begin with the full logistic regression using all variables.
model <- glm(Dement~.,family=binomial(link="logit"),data=train)
summary(model)
##
## Call:
## glm(formula = Dement ~ ., family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.47436 -0.47016 -0.17855 0.07961 2.40224
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.384e+01 2.397e+01 2.246 0.024684 *
## MR.Delay -2.913e-04 3.801e-04 -0.766 0.443391
## Age -1.608e-01 4.228e-02 -3.802 0.000143 ***
## EDUC -1.793e-01 1.281e-01 -1.400 0.161536
## SES -4.400e-01 3.082e-01 -1.428 0.153314
## MMSE -1.217e+00 2.017e-01 -6.032 1.62e-09 ***
## eTIV 4.263e-03 7.192e-03 0.593 0.553312
## nWBV -3.218e+01 8.269e+00 -3.892 9.94e-05 ***
## ASF 1.038e+01 9.607e+00 1.081 0.279755
## Sex 1.829e+00 5.904e-01 3.097 0.001955 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 341.95 on 252 degrees of freedom
## Residual deviance: 144.11 on 243 degrees of freedom
## AIC: 164.11
##
## Number of Fisher Scoring iterations: 7
Now we will use step-wise variable selection to create a model. This method should result in a better AIC score for the final model. The best model is described at the bottom of the code output.
model.null = glm(Dement ~ 1, data=train, family= binomial(link = "logit"))
model.full = glm(Dement ~ Age+ MMSE + eTIV + nWBV + Sex + MR.Delay + EDUC + SES + ASF, data=train, family = binomial(link="logit"))
step(model.full,scope = list(upper=model.full), direction="both", test="Chisq", data=train)
## Start: AIC=164.11
## Dement ~ Age + MMSE + eTIV + nWBV + Sex + MR.Delay + EDUC + SES +
## ASF
##
## Df Deviance AIC LRT Pr(>Chi)
## - eTIV 1 144.46 162.46 0.347 0.555859
## - MR.Delay 1 144.70 162.70 0.590 0.442350
## - ASF 1 145.25 163.25 1.140 0.285600
## <none> 144.11 164.11
## - EDUC 1 146.19 164.19 2.083 0.148983
## - SES 1 146.24 164.24 2.128 0.144654
## - Sex 1 154.75 172.75 10.638 0.001108 **
## - nWBV 1 161.51 179.51 17.397 3.033e-05 ***
## - Age 1 161.57 179.57 17.460 2.934e-05 ***
## - MMSE 1 255.35 273.35 111.244 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=162.46
## Dement ~ Age + MMSE + nWBV + Sex + MR.Delay + EDUC + SES + ASF
##
## Df Deviance AIC LRT Pr(>Chi)
## - MR.Delay 1 145.12 161.12 0.658 0.417211
## - EDUC 1 146.32 162.32 1.864 0.172216
## <none> 144.46 162.46
## - SES 1 146.50 162.50 2.046 0.152626
## + eTIV 1 144.11 164.11 0.347 0.555859
## - ASF 1 149.95 165.95 5.489 0.019134 *
## - Sex 1 155.23 171.23 10.775 0.001029 **
## - Age 1 161.57 177.57 17.115 3.519e-05 ***
## - nWBV 1 161.82 177.82 17.365 3.085e-05 ***
## - MMSE 1 255.38 271.38 110.925 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=161.12
## Dement ~ Age + MMSE + nWBV + Sex + EDUC + SES + ASF
##
## Df Deviance AIC LRT Pr(>Chi)
## - EDUC 1 147.11 161.11 1.997 0.157577
## <none> 145.12 161.12
## - SES 1 147.47 161.47 2.360 0.124501
## + MR.Delay 1 144.46 162.46 0.658 0.417211
## + eTIV 1 144.70 162.70 0.415 0.519490
## - ASF 1 150.90 164.90 5.785 0.016159 *
## - Sex 1 155.69 169.69 10.578 0.001144 **
## - nWBV 1 162.37 176.37 17.259 3.261e-05 ***
## - Age 1 163.17 177.17 18.057 2.144e-05 ***
## - MMSE 1 261.05 275.05 115.939 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=161.11
## Dement ~ Age + MMSE + nWBV + Sex + SES + ASF
##
## Df Deviance AIC LRT Pr(>Chi)
## - SES 1 147.66 159.66 0.550 0.458138
## <none> 147.11 161.11
## + EDUC 1 145.12 161.12 1.997 0.157577
## + MR.Delay 1 146.32 162.32 0.792 0.373514
## + eTIV 1 146.95 162.95 0.162 0.687400
## - ASF 1 152.73 164.73 5.614 0.017813 *
## - Sex 1 157.62 169.62 10.512 0.001186 **
## - Age 1 164.02 176.02 16.910 3.919e-05 ***
## - nWBV 1 164.17 176.17 17.056 3.630e-05 ***
## - MMSE 1 266.74 278.74 119.627 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=159.66
## Dement ~ Age + MMSE + nWBV + Sex + ASF
##
## Df Deviance AIC LRT Pr(>Chi)
## <none> 147.66 159.66
## + MR.Delay 1 146.72 160.72 0.942 0.331714
## + SES 1 147.11 161.11 0.550 0.458138
## + eTIV 1 147.44 161.44 0.225 0.634948
## + EDUC 1 147.47 161.47 0.188 0.664583
## - ASF 1 152.78 162.78 5.118 0.023682 *
## - Sex 1 157.64 167.64 9.978 0.001584 **
## - Age 1 164.25 174.25 16.583 4.657e-05 ***
## - nWBV 1 164.71 174.71 17.051 3.640e-05 ***
## - MMSE 1 269.58 279.58 121.915 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call: glm(formula = Dement ~ Age + MMSE + nWBV + Sex + ASF, family = binomial(link = "logit"),
## data = train)
##
## Coefficients:
## (Intercept) Age MMSE nWBV Sex
## 61.641 -0.151 -1.202 -30.676 1.630
## ASF
## 4.103
##
## Degrees of Freedom: 252 Total (i.e. Null); 247 Residual
## Null Deviance: 342
## Residual Deviance: 147.7 AIC: 159.7
The final model found by step-wise selection results in a slightly better AIC than the original full model containing all variables. We will put this best model in a variable called model.final. This is the model we will use to make predictions on the test dataset.
model.final = glm(Dement ~ Age + MMSE + nWBV + Sex + EDUC, data=train, family = binomial(link="logit"))
summary(model.final)
##
## Call:
## glm(formula = Dement ~ Age + MMSE + nWBV + Sex + EDUC, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.31776 -0.48394 -0.22212 0.09347 2.26070
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 65.53741 11.11837 5.895 3.76e-09 ***
## Age -0.14688 0.03959 -3.710 0.000207 ***
## MMSE -1.21295 0.19171 -6.327 2.50e-10 ***
## nWBV -27.25496 7.57551 -3.598 0.000321 ***
## Sex 0.93899 0.42491 2.210 0.027113 *
## EDUC -0.08089 0.07558 -1.070 0.284477
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 341.95 on 252 degrees of freedom
## Residual deviance: 151.63 on 247 degrees of freedom
## AIC: 163.63
##
## Number of Fisher Scoring iterations: 7
plot(model.final)
Interpretation
Under the coefficients are parameters for the model. The estimate value (under the intercept estimate) of the logistic regression provides the log of odd ratio. The intercept estimate 65.082 is the null model value when all the variables is zero. For every one unit changes in any of the variables (Age, MMSE, nWBV, Sex, EDUC), the log of odd increases or decreases according to Estimate values. For the final model, age will decrease the log of odd ratio by -0.1556 with one unit increases in age, MMSE will decrease the log odds by -1.2088 with one unit increase in MMSE, nWBV will decrease the log odds by -24.31 with one unit increases in nWBV, being male increases the log odds by 0.9068, and EDUC will decrease the log odds by -0.153 with one unit increase in years of education. The standard error is the estimated variability due to sampling variability. In this case, the standard error for Age is 0.04 and nWBV is 7.37. According to the p values and significant codes, Age, MMSE and nWBV has strong association with Dement than Sex variable. EDUC is not significant according to the p-values and significant codes. The Null deviance and Residual deviance tells how the final model is doing against the null model, which only have the intercept. The model.final of Alzheimer’s disease has Null deviance of 341.18 and Residual deviance of 146.23. The gap between these two deviance is large meaning the final model is a lot better than the null model. Adding a variable to the model leads to increase in Residual deviance and increase in AIC value in our case. The final model have lowest AIC and significant drop in residual deviance.
library(popbio)
logi.hist.plot(alzheimers$MMSE,alzheimers$Dement, boxp=FALSE, type= "histogram", col="gray", main = "Demented vs MMSE")
logi.hist.plot(alzheimers$nWBV,alzheimers$Dement, boxp=FALSE, type= "histogram", col="gray", main = "Demented vs nWBV")
logi.hist.plot(alzheimers$Age,alzheimers$Dement, boxp=FALSE, type= "histogram", col="gray", main = "Demented vs Age")
logi.hist.plot(alzheimers$eTIV,alzheimers$Dement, boxp=FALSE, type= "histogram", col="gray", main = "Demented vs eTIV")
logi.hist.plot(alzheimers$Sex,alzheimers$Dement, boxp=FALSE, type= "histogram", col="gray", main = "Demented vs Sex")
exp(coef(model.final))
## (Intercept) Age MMSE nWBV Sex
## 2.900920e+28 8.634013e-01 2.973199e-01 1.456535e-12 2.557408e+00
## EDUC
## 9.222947e-01
Y = 65.083 - 0.156(Age) - 1.209 (MMSE) - 24.308 (nWBV) + 0.907 (Sex ) -0.153 (EDUC)
Where Y = logit ( p/ (1-p))
Use model.final to make predictions on the test dataset.
results <- predict(model.final,newdata=test,type='response')
results_decision <- ifelse(results > 0.5,1,0)
plot(results,test$Dement,main="Predicted Results")
misClasificError <- mean(results_decision != test$Dement)
accuracy <- 1-misClasificError
print(paste('Accuracy',accuracy))
## [1] "Accuracy 0.875"
The final model produces a probability of dementia for each of the 64 observations in the test set. We use a threshold of p = 0.5 to classify subjects as nondemented or demented. Probabilities greater than 0.5 represent a prediction of dementia while probabilities less than 0.5 predict non-demented patients. Comparing the predicted values to the actual values provided in the test set, the model yields a prediction accuracy rate of ~80%. This value suggests the model is successful at predicting dementia on a new set of subjects.
ROC and AUC ROC curve is generated to figured out the AUC, area under the curve, which tells the model’s predictive ability. As the model is with good predictive ability is closer to 1, we can say that our model has good predictive ability since the model’s AUC value is 0.927.
alzheimers_roc <- glm(Dement ~ Age + MMSE + nWBV + Sex + EDUC, data=test, family = "binomial")
result1= predict(alzheimers_roc, type= c("response"))
library("ROCR")
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pred <- prediction(result1,test$Dement)
pref <- performance (pred, measure = "tpr", x.measure= "fpr")
plot (pref, col= rainbow(7), main = "ROC curve Demented")
auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.9635417
While we were generally satisfied with the results, we would have naturally preferred access to a larger data set to build our model. The accuracy of our model would benefit from access to a greater number of participants, since the sample from the OASIS data was somewhat small. We also would have liked to have examined more potential predictors, and were limited by what was available in the OASIS data set. For example, variables concerning diet, exercise, smoking, etc., could have been relevant to the model. It probably would have been easy to ask those questions of participants at the time of their MRI test.
Acknowledgements
alz.org
When publishing findings that benefit from OASIS data, please include the following grant numbers in the acknowledgements section and in the associated Pubmed Central submission: P50 AG05681, P01 AG03991, R01 AG021910, P20 MH071616, U24 RR0213